Digital Library and Archiving for Qatar

نویسندگان

Tarek Kanan

Sagnik Ray Choudhury

C. Lee Giles

Prashant Chandrasekar

Edward A. Fox

چکیده

Crawling and Indexing Qatari Scholarly ContentSeerQ SeerSuite is a collection management system for digital libraries, developed at Penn State. It includes: 1) A Web crawler for scholarly articles; 2) A machine learning based automated system for metadata (title, abstract, author name/affiliation, citations) extraction; 3) A module for ingesting extracted information into a database and Solr; and 4) A JSP based front end for users. SeerQ reflects our modification of SeerSuite to address Qatari requirements. It uses both Heritrix and an in-house developed OAI-PMH based crawler, which accesses digital repositories in Qatar that expose their metadata and content, especially QScience, a publisher in Doha focusing on scholarly content produced in Qatar. Other seeds for crawling were provided by the Qatar National Library and cover websites such as QCRI, Qatar University, and varied research establishments. We have around 3300 documents ingested and around 4000 documents crawled. Metadata records with an author name, title, and citations are available through OAI-PMH. Crawling and SearchingLucidworks Fusion Fusion is software by LucidWorks. It has the ability to collect and index documents using Apache Solr. It also provides utilities like pipelines, connectors, and logging routines. We devised our own interface to suit the needs in Qatar. We used this software to build many Qatari collections about online news resources, government activities, sports, etc. Fusion has the ability to handle multi-lingual content; our collections are in Arabic and English. We fed Fusion the seed URLs to the resources than it collected and indexed. Our Qatari Arabic news article collection has extra information, e.g., news article summaries we generated using machine learning based methods. Accordingly, we modified the Fusion configuration so there are extra fields in the schema file, and the interface to show both summary and article. Earlier we crawled 5200 PDF news files using Heritrix, each file having multiple news articles. We parsed the files, extracted the Arabic text, and ended up with roughly 120,000 news articles, for which we automatically generated summaries. Crawling and Archiving in Qatar National Library (QNL) QNL is crawling and archiving Web content related to Qatar. The challenges of this task include: 1. Seeds: Getting proper seeds for such a large-scale crawl is very hard, but ICT Qatar (the governing body for the Internet in Qatar) has agreed to provide a list of .qa domains. 2. Dynamic content: As the host of the upcoming 2022 FIFA World Cup, Qatar has been much in the news, with rapidly changing dynamic content. It is difficult for curators to apply tools like Heritrix, especially to adjust to large volumes of changes as well as sitemaps and RSS feeds. 3. Identifying “Qatar related” content: Many websites where Qatar is discussed (soccer or Middle East forums) contain text unrelated to Qatar. It is a challenge to quickly identify all and only the relevant URLs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Introduction to the Web Archiving and Digital Libraries 2015 Workshop Issue

Our understanding of the past will, to a large extent, depend on our success with Web archiving. WADL 2015 brought together international leaders from industry, government, and academia, who are tackling this important challenge. This special issue includes summaries of twelve presentations on 24 June 2015. It is hoped that these works will stimulate other digital library (DL) and related inves...

متن کامل

Archiving and Analyzing Tweets and Webpages with the DLRL Hadoop Cluster

Sunshin Lee Dept. of Computer Science, Virginia Tech Blacksburg, VA 24061 USA [email protected] Edward A. Fox Dept. of Computer Science, Virginia Tech Blacksburg, VA 24061 USA [email protected] ABSTRACT In the Integrated Digital Event Archive and Library (IDEAL) [1] project we research the next generation integration of digital libraries and event archiving. The project team has been collecting Internet...

متن کامل

The Web-at-Risk at Three: Overview of an NDIIPP Web Archiving Initiative

The Web-at-Risk project is a multi-year National Digital Information Infrastructure and Preservation Program (NDIIPP) funded effort to enable librarians and archivists to capture, curate, and preserve political and government information on the Web, and to make the resulting Web archives available to researchers. The Web-at-Risk project is a collaborative effort between the California Digital L...

متن کامل

Digital Archiving in the Twenty-First Century: Practice at the National Library of the Netherlands

Research journals are increasingly being published digitally. The advantage of digital publishing is obvious: immediate accessibility anywhere. Gradually a disadvantage is also becoming clear: digital publishing endangers the continuity of research information. As a consequence of the obsolescence of formats, hardware, software, and carriers, digital information will be lost unless we act. Digi...

متن کامل